Sample Selection Bias Correction Theory

نویسندگان

  • Corinna Cortes
  • Mehryar Mohri
  • Michael Riley
  • Afshin Rostamizadeh
چکیده

This paper presents a theoretical analysis of sample selection bias correction. The sample bias correction technique commonly used in machine learning consists of reweighting the cost of an error on each training point of a biased sample to more closely reflect the unbiased distribution. This relies on weights derived by various estimation techniques based on finite samples. We analyze the effect of an error in that estimation on the accuracy of the hypothesis returned by the learning algorithm for two estimation techniques: a cluster-based estimation technique and kernel mean matching. We also report the results of sample bias correction experiments with several data sets using these techniques. Our analysis is based on the novel concept of distributional stability which generalizes the existing concept of point-based stability. Much of our work and proof techniques can be used to analyze other importance weighting techniques and their effect on accuracy when using a distributionally stable algorithm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Correcting sample selection bias in maximum entropy density estimation

We study the problem of maximum entropy density estimation in the presence of known sample selection bias. We propose three bias correction approaches. The first one takes advantage of unbiased sufficient statistics which can be obtained from biased samples. The second one estimates the biased distribution and then factors the bias out. The third one approximates the second by only using sample...

متن کامل

Bias Correction in Small Sample from Big Data

This paper discusses the bias problem when estimating the population size of big data such as online social networks (OSN) using uniform random sampling and simple random walk. Unlike the traditional estimation problem where the sample size is not very small relative to the data size, in big data a small sample relative to the data size is already very large and costly to obtain. We point out t...

متن کامل

Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

In search advertising, the search engine needs to select the most profitable advertisements to display, which can be formulated as an instance of online learning with partial feedback, also known as the stochastic multi-armed bandit (MAB) problem. In this paper, we show that the naive application of MAB algorithms to search advertising for advertisement selection will produce sample selection b...

متن کامل

The effects of sample selection bias on racial differences in child abuse reporting.

OBJECTIVE The aim was to examine whether design features of Wave 1, 1980 National Incidence Study (NIS) data resulted in sample selection bias when certain victims of maltreatment were excluded. METHOD Logistic regression models for the probability of child abuse reports to the child protective services (CPS) were estimated using maximum likelihood methods for Black (n = 511) and White (n = 2...

متن کامل

The Economic Value of Reject Inference in Credit Scoring

We use data with complete information on both rejected and accepted bank loan applicants to estimate the value of sample bias correction using Heckman’s two-stage model with partial observability. In the credit scoring domain such correction is called reject inference. We validate the model performances with and without the correction of sample bias by various measurements. Results show that it...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008